Reddit is known as the “Front Page of the Internet” and is a popular forum especially among young people where users can post anything and everything. Unlike other social media platforms the majority of the Reddit users remain anonymous. We believe that the anonymity of the forum allows us to train and test NLP models. It has a large international community and a lot of programming related content. We want to use data from the Reddit forum in order to better understand the popularity of Programming languages among Reddit users.
Additionally we want to compare it to data from the Stackoverflow forum. The Stackoverflow forum is a forum that is more focussing on solving programming related Problems.
We want to evaluate what programming languages are being discussed in both forums and compare them. For reaching our aim we want to use Visualization and Machine Learning methods based on text data but also use the quantification that we get from the upvotes and number of comments.
For the Reddit posts the plan is to use an API from Reddit to get data sets for a certain time range and a number of specific Subreddits. The choice of the Subreddits is crucial for the quality and expressiveness of our data and will be based on some prior research on interesting Subreddits regarding programming. From this data we can then get the Subreddit, title, text, upvotes and various metadata.
| data.subreddit | data.title | data.id | data.created | data.created_utc | data.upvote_ratio | data.ups | data.score | data.num_comments |
|---|---|---|---|---|---|---|---|---|
| coding | Back-End VS Front-End Framework | 6 J.S. Frameworks Experts Love - Untied Blogs | nh0yzf | 1621547972 | 1621519172 | 0.25 | 0 | 0 | 1 |
| coding | File Descriptor Limits | ngzeep | 1621543958 | 1621515158 | 0.33 | 0 | 0 | 0 |
| coding | Introduction to Continuous Profiling | ngy73c | 1621540423 | 1621511623 | 0.92 | 24 | 24 | 3 |